iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 11
0
AI & Data

Machine Learning系列 第 11

Day11 - Feature Engineering -- 4. 分隔方法(Discretization)

  • 分享至 

  • xImage
  •  

4. Discretization(分隔方法 or 離散化)

4.1 Equal width discretisation
4.2 Equal Frequency discretisation
4.3 Discretisation using decision trees

將變數下的資料值(可以是ordinal categorical variable or numeric variable)排序並放入所屬區間(intervals, bins or buckets),這個過程也稱為分箱(binning)。
分隔變數(discretize variables)可用的方法如下:

4.1 Equal width discretisation

這個方法將資料值放進N個寬度相同的區間,變數下資料的範圍和區間的數目決定區間的寬度。

寬度(width) = (最大值max value - 最小值min value) / N

雖然沒有嚴格的規定如何決定N的數目,但基本上以不超過10個為原則。還要注意一點,如果原始資料的分布是偏態(skewed)分布,這個方法不會改善資料的分布狀況。

我們可以使用pandas或scikit-learn 來對資料做分隔。

以 Kaggle 的 Titanic 資料集中的"年齡"變數來說明:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pylab
import scipy.stats as stats
from sklearn.model_selection import train_test_split
# for discretization
from sklearn.preprocessing import KBinsDiscretizer

data = pd.read_csv('../input/titanic/train.csv', usecols=['Age', 'Fare','Survived'])
data.head

Rec-no.|Survived | Age | Fare
------------- | -------------
0| 0 | 22.0 | 7.2500|
1 | 1 | 38.0| 71.2833|
2 | 1| 26.0 | 7.9250|
3| 1| 35.0 | 53.1000|
4 | 0| 35.0 | 8.0500|
.. | ... | ... | ...|
886| 0 | 27.0| 13.0000|
887| 1 | 19.0| 30.0000|
888 | 0 | NaN | 23.4500|
889| 1 | 26.0 | 30.0000|
890 | 0| 32.0 | 7.7500|

# first fill the missing data of the variable age, with a random sample of the variable

def impute_na(data, variable):
    # function to fill na with a random sample
    df = data.copy()
   
    # random sampling
    df[variable+'_random'] = df[variable]
    
    # extract the random sample to fill the na
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0)
    
    # pandas needs to have the same index in order to merge datasets
    random_sample.index = df[df[variable].isnull()].index
    df.loc[df[variable].isnull(), variable+'_random'] = random_sample
    
    return df[variable+'_random']
    
data['Age'] = impute_na(data, 'Age')

將資料分成訓練和測試集

X_train, X_test, y_train, y_test = train_test_split(data[['Age', 'Fare','Survived']],                                              data.Survived, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((623, 3), (268, 3))

(1)使用pandas
找出資料範圍並將其切成10個相同寬度的區間。

age_range = X_train['Age'].max() - X_train['Age'].min()
print(age_range)

# divide the range into 10 equal-width bins
print(age_range / 8)

79.58
9.9475

min_value = int(np.floor( X_train['Age'].min()))
max_value = int(np.ceil( X_train['Age'].max()))

# let's round the bin width
inter_width = int(np.round(age_range/10))

min_value, max_value, inter_width

(0, 80, 8)

找出每個區間的界線值

intervals = [i for i in range(min_value, max_value+inter_width, inter_width)]

intervals

[0, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80]

把每筆資料的區間範圍寫入age_disc欄位。

# discretise Age
X_train['age_disc'] = pd.cut(x=X_train['Age'],
                               bins=intervals,
                               include_lowest=True)

print(X_train[['Age', 'age_disc']].head(10))

/| Age| age_disc
------------- | -------------
857| 51.0 | (48.0, 56.0]
52 | 49.0 | (48.0, 56.0]
386| 1.0 | (-0.001, 8.0]
124| 54.0| (48.0, 56.0]
578| 19.0 | (16.0, 24.0]
549| 8.0 | (-0.001, 8.0]
118| 24.0| (16.0, 24.0]
12 | 20.0 | (16.0, 24.0]
157| 30.0| (24.0, 32.0]
127| 24.0| (16.0, 24.0]

查看每個區間的資料數目

# check the number of observations per bin

X_train['age_disc'].value_counts()
(16.0, 24.0] 146
(24.0, 32.0] 145
(32.0, 40.0] 116
(40.0, 48.0] 62
(-0.001, 8.0] 52
(48.0, 56.0] 34
(8.0, 16.0] 34
(56.0, 64.0] 24
(64.0, 72.0] 8
(72.0, 80.0] 2
Name: age_disc, dtype: int64

繪製每個區間資料數量的成長條圖

# plot the number of observations per bin

X_train.groupby('age_disc')['Age'].count().plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

https://ithelp.ithome.com.tw/upload/images/20200911/20129584vwwqZ2TCDw.png

對測試資料做區隔

# discretise the variables in the test set
X_test['age_disc'] = pd.cut(x=X_test['Age'],
                              bins=intervals,
                              include_lowest=True)
X_test.head()

比較訓練集和測試集的區間資料分布情形。

# determine proportion of observations in each bin
t1 = X_train['age_disc'].value_counts() / len(X_train)
t2 = X_test['age_disc'].value_counts() / len(X_test)

# concatenate aggregated views
tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']

# plot
tmp.plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

https://ithelp.ithome.com.tw/upload/images/20200911/201295848wuMoLVn0S.png

原始資料中年齡和存活的比較圖。

fig = plt.figure()
fig = X_train.groupby(['Age'])['Survived'].mean().plot()
fig.set_title('Normal relationship between Age and Survived')
fig.set_ylabel('Survived')

https://ithelp.ithome.com.tw/upload/images/20200911/20129584VqhNA7Q1XS.png
做分隔後的資料中年齡和存活的比較圖。

fig = plt.figure()
fig = X_train.groupby(['age_disc'])['Survived'].mean().plot(figsize=(12,6))
fig.set_title('Normal relationship between variable and target')
fig.set_ylabel('Survived')

https://ithelp.ithome.com.tw/upload/images/20200911/20129584NvjywvWmAa.png

(2)使用scikit-learn

disc = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
disc.fit(X_train[['Age']])

KBinsDiscretizer(encode='ordinal', n_bins=10, strategy='uniform')
各區間的界線值儲存在disc.bin_edges 中

disc.bin_edges_

array([array([ 0.42 , 8.378, 16.336, 24.294, 32.252, 40.21 , 48.168, 56.126,
64.084, 72.042, 80. ])], dtype=object)

train_t = disc.transform(X_train[['Age']])
train_t = pd.DataFrame(train_t, columns = ['Age'])
test_t = disc.transform(X_test[['Age']])
test_t = pd.DataFrame(test_t, columns = ['Age'])
train_t.head()
/ Age
0
1 6.0
2 0.0
3 6.0
4 2.0

比較訓練集和測試集的區間資料分布情形。

t1 = train_t.groupby(['Age'])['Age'].count() / len(train_t)
t2 = test_t.groupby(['Age'])['Age'].count() / len(test_t)

tmp = pd.concat([t1, t2], axis=1)
tmp.columns = ['train', 'test']
tmp.plot.bar()
plt.xticks(rotation=45)
plt.ylabel('Number of observations per bin')

https://ithelp.ithome.com.tw/upload/images/20200911/20129584Vw53azEw1u.png


上一篇
Day-10 Feature Engineering -- 3. Variable transformation 變數轉換(2)
下一篇
Day12 - Feature Engineering -- 4. 分隔方法(Discretization)(2)
系列文
Machine Learning32
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言